智能论文笔记

Learning policies that effectively utilize language instructions in complex, multi-task environments is an important problem in sequential decision-making. While it is possible to condition on the entire language instruction directly, such an approach could suffer from generalization issues. In our work, we propose \emph{Learning Interpretable Skill Abstractions (LISA)}, a hierarchical imitation learning framework that can learn diverse, interpretable primitive behaviors or skills from language-conditioned demonstrations to better generalize to unseen instructions. LISA uses vector quantization to learn discrete skill codes that are highly correlated with language instructions and the behavior of the learned policy. In navigation and robotic manipulation environments, LISA outperforms a strong non-hierarchical Decision Transformer baseline in the low data regime and is able to compose learned skills to solve tasks containing unseen long-range instructions. Our method demonstrates a more natural way to condition on language in sequential decision-making problems and achieve interpretable and controllable behavior with the learned skills.

translated by 谷歌翻译

脊髓损伤通常会导致四肢瘫痪的患者限制其活动能力。轮椅对于患者来说可能是一个很好的主张，但大多数人可以手动操作，也可以借助操纵杆操作的电动机。但是，这需要使用手，使其不适合四肢瘫痪的患者。另一方面，即使受到脑损伤的人，控制眼动的运动也保留了。监视眼睛中的运动可能是为轮椅生成控制信号的有用工具。本文是通过试图控制模仿轮椅的机器人来转换从眼睛转换为有意义的信号的一种方法。总体系统具有成本效益，并使用简单的图像处理和模式识别来控制机器人。开发了一种Android应用，在实际情况下，患者的援助可以使用该应用程序，以更加完善轮椅。

translated by 谷歌翻译

自我监督学习（SSL）的承诺是利用大量未标记的数据来解决复杂的任务。尽管简单，图像级学习取得了出色的进步，但最新方法显示出包括图像结构知识的优势。但是，通过引入手工制作的图像分割来定义感兴趣的区域或专门的增强策略，这些方法牺牲了使SSL如此强大的简单性和通用性。取而代之的是，我们提出了一个自我监督的学习范式，该学习范式本身会发现这种图像结构。我们的方法，ODIN，夫妻对象发现和表示网络，以发现有意义的图像分割，而无需任何监督。由此产生的学习范式更简单，更易碎，更一般，并且取得了最先进的转移学习结果，以进行对象检测和实例对可可的细分，以及对Pascal和CityScapes的语义细分，同时超过监督的预先培训，用于戴维斯的视频细分。

translated by 谷歌翻译

HiP: Hierarchical Perceiver

Joao Carreira , Skanda Koppula , Daniel Zoran , Adria Recasens , Catalin Ionescu , Olivier Henaff , Evan Shelhamer , Relja Arandjelovic , Matt Botvinick , Oriol Vinyals

分类：计算机视觉

2022-02-22

General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by using exclusively global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of locality can be introduced back into these models, greatly improving their efficiency while preserving their generality. To scale them further, we introduce a self-supervised approach that enables learning dense low-dimensional positional embeddings for very large signals. We call the resulting model a Hierarchical Perceiver (HiP). In sum our contributions are: 1) scaling Perceiver-type models to raw high-resolution images and audio+video, 2) showing the feasibility of learning 1M+ positional embeddings from scratch using masked auto-encoding, 3) demonstrating competitive performance on raw data from ImageNet, AudioSet, PASCAL VOC, ModelNet40 and Kinetics datasets with the same exact, unchanged model and without specialized preprocessing or any tokenization.

translated by 谷歌翻译